Skip to content

Add poison message handling to the dispatchers#1366

Open
sophiatev wants to merge 22 commits into
mainfrom
stevosyan/add-poison-message-handling
Open

Add poison message handling to the dispatchers#1366
sophiatev wants to merge 22 commits into
mainfrom
stevosyan/add-poison-message-handling

Conversation

@sophiatev

Copy link
Copy Markdown
Contributor

This PR introduces poison message handling to the dispatchers. This is done by

  1. Introducing a new interface IPoisonMessageHandler that the any orchestration service which has poison message handling is expected to implement
  2. In the case of data corruption, the dispatchers will invoke this interface (if provided) to handle the message rather than throwing an exception
  3. Otherwise, the dispatchers will invoke this interface to determine if a message is "poisoned", and if so take remediating action if possible (for example failing the orchestration or entity call).

The orchestration service is otherwise responsible for determining what to do with the poison message(s) and how to store them.

Copilot AI review requested due to automatic review settings June 10, 2026 16:47
Comment thread src/DurableTask.Core/TaskEntityDispatcher.cs Fixed
Comment thread src/DurableTask.Core/TaskOrchestrationDispatcher.cs Fixed
this.Reason = reason;
}

// Private ctor for JSON deserialization (required by some storage providers and out-of-proc executors)

@sophiatev sophiatev Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR but I bug I found when testing (JSON was not able to deserialize this event because it lacked a 0-arg constructor and the other constructors all had multiple parameters)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also unrelated to this PR, but I realized while working on it that this code I wrote a while back had some incorrect assumptions so I took the opportunity to fix it

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an extensibility hook (IPoisonMessageHandler) and integrates poison/invalid message detection into the core dispatchers so that corrupted or “poisoned” inputs can be handled deterministically (e.g., fail orchestration/activity/entity work) instead of always throwing.

Changes:

  • Introduces IPoisonMessageHandler and wires it into orchestration/activity/entity dispatchers for invalid work items and poison message detection.
  • Adds structured logging support for poison-message detection (new event ID + event source + log event).
  • Adds dispatch-count tracking on history events and propagates poison metadata through entity request processing.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/DurableTask.Core/Tracing/TraceHelper.cs Adjusts entity invocation activity ending to better handle partial result sets.
src/DurableTask.Core/TaskOrchestrationDispatcher.cs Adds poison detection/handling and updates reconciliation to return a drop reason.
src/DurableTask.Core/TaskEntityDispatcher.cs Adds poison detection/handling for entity messages, plus poison-aware batching/result shaping.
src/DurableTask.Core/TaskActivityDispatcher.cs Adds poison/invalid handling for activity scheduling messages (including failing poisoned tasks).
src/DurableTask.Core/Logging/StructuredEventSource.cs Adds a new structured event for poison message detection.
src/DurableTask.Core/Logging/LogHelper.cs Adds PoisonMessageDetected helper overloads emitting structured logs.
src/DurableTask.Core/Logging/LogEvents.cs Adds a new structured log event type for poison messages.
src/DurableTask.Core/Logging/EventIds.cs Reserves a new event ID for poison message detection.
src/DurableTask.Core/IPoisonMessageHandler.cs New interface defining poison detection and handling hooks.
src/DurableTask.Core/History/HistoryEvent.cs Adds DispatchCount to history events for poisoning heuristics/telemetry.
src/DurableTask.Core/History/ExecutionRewoundEvent.cs Adds a parameterless ctor for JSON deserialization compatibility.
src/DurableTask.Core/Entities/OrchestrationEntityContext.cs Adds AbandonAcquire() to reset lock acquisition state on failure.
src/DurableTask.Core/Entities/EventFormat/RequestMessage.cs Adds poison metadata fields used during entity request processing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/DurableTask.Core/TaskOrchestrationDispatcher.cs
Comment thread src/DurableTask.Core/TaskEntityDispatcher.cs Outdated
… combined'

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 10, 2026 17:00
… combined'

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Comment thread src/DurableTask.Core/TaskOrchestrationDispatcher.cs
Comment thread src/DurableTask.Core/TaskEntityDispatcher.cs
Comment thread src/DurableTask.Core/TaskEntityDispatcher.cs Outdated

@cgillum cgillum left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments. I haven't gone through the dispatcher code yet (those are bigger diffs).

/// If the request message is poisoned, the reason it is poisoned.
/// Otherwise, null.
/// </summary>
public string? PoisonReason { get; set; }

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use case for PoisonReason in entity request messages?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used when generating the failure response for a poisoned entity operation, i.e. here.

Since processing the history event that corresponds to the entity request is decoupled from sending the response, we need to store the poison reason in the request message so we can use it later to populate the failure details.

Comment thread src/DurableTask.Core/Logging/LogEvents.cs Outdated
Comment thread src/DurableTask.Core/Logging/StructuredEventSource.cs Outdated
/// <param name="historyEvent">The history event being dispatched.</param>
/// <param name="reason">Why the message is considered poisoned.</param>
/// <returns><c>true</c> if the message should be treated as poisoned; otherwise <c>false</c>.</returns>
public bool IsPoisonMessage(HistoryEvent historyEvent, out string? reason);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the backend to decide whether something is a poison message? Shouldn't the dispatcher be making this decision?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't, and it definitely can. I thought it would be the responsibility of the "poison message handler" since it's sort of in the name but this was just a somewhat arbitrary choice on my end. I can return responsibility to the dispatchers

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that's tricky about reviewing this interface is that we don't yet have any implementations of it. I'm thinking we should probably implement this for Azure Storage before committing to these new public APIs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually one thing I remembered is that doing it this way meant I didn't have to pass a maxDispatchCount when instantiating the dispatchers, which meant we didn't need yet another overload for the TaskHubWorker to accept this parameter when it creates them. But if we want to keep this decision on the dispatcher side and avoid another parameter, we can maybe expose this via IPoisonMessageHandler.MaxDispatchCount, if that sounds reasonable?

Comment thread src/DurableTask.Core/IPoisonMessageHandler.cs
Comment thread src/DurableTask.Core/TaskOrchestrationDispatcher.cs Outdated
Comment thread src/DurableTask.Core/TaskOrchestrationDispatcher.cs Outdated

@cgillum cgillum left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished reviewing. Just a few more comments (and some responses).

Comment thread src/DurableTask.Core/TaskActivityDispatcher.cs Outdated
Comment thread src/DurableTask.Core/TaskActivityDispatcher.cs Outdated
/// <param name="historyEvent">The history event being dispatched.</param>
/// <param name="reason">Why the message is considered poisoned.</param>
/// <returns><c>true</c> if the message should be treated as poisoned; otherwise <c>false</c>.</returns>
public bool IsPoisonMessage(HistoryEvent historyEvent, out string? reason);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that's tricky about reviewing this interface is that we don't yet have any implementations of it. I'm thinking we should probably implement this for Azure Storage before committing to these new public APIs.

@@ -1,4 +1,4 @@
// ----------------------------------------------------------------------------------
// ----------------------------------------------------------------------------------

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to have @sebastianburckhardt review the changes to this file as I'm less familiar with the details of entity dispatching.

Comment thread src/DurableTask.Core/TaskOrchestrationDispatcher.cs Outdated
Copilot AI review requested due to automatic review settings June 12, 2026 01:38

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.

Comment thread src/DurableTask.Core/TaskOrchestrationDispatcher.cs
Comment thread src/DurableTask.Core/Logging/StructuredEventSource.cs Outdated
Comment thread src/DurableTask.Core/TaskEntityDispatcher.cs
Comment thread src/DurableTask.Core/TaskEntityDispatcher.cs
Comment thread src/DurableTask.Core/Logging/LogEvents.cs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants